[1] 1.96
DATA1220-55, Fall 2024
2024-10-02
Why can we use the sample statistic (e.g. sample mean \(\bar{x}\), standard deviation \(s\)) as point estimates for the population parameters (e.g. population mean \(\mu\), population standard deviation \(\sigma\))?
Statistical analysis requires making assumptions about the world around us which may or may not be true.
ASSUMPTION
The probability distribution of a random process follows a known distribution (e.g. a normal distribution), which we can model and from which we can draw inferences about the parameters which govern that process.
ASSUMPTION
We have collected enough data and that data is trustworthy enough that our sample statistics are reliable estimators of the “ground truth” in our sample population.
ASSUMPTION
Our sample population is sufficiently representative of our study population such that our sample statistics are valid estimators of the population parameters in our study population.
ASSUMPTION
Our study population is sufficiently representative of our target population such that inferences about the population parameters of our study population are generalizable to our target population.
Accuracy describes how similar a sample statistic is to the “true” population parameter
Precision describes how similar the sample statistics in a sampling distribution are to each other (i.e. the variability of the estimates)
Contingency Table for Accurate and/or Precise Outcomes
Reliable data \(\rightarrow\) sample statistics are accurate estimators of sample population parameters
Valid data \(\rightarrow\) sample statistics are accurate estimators of sampling distribution in study population
Generalizable data \(\rightarrow\) sampling distribution of study population is accurate estimator of sampling distribution in target population
Larger samples \(\rightarrow\) less variability \(\rightarrow\) more precise estimates
More representative samples \(\rightarrow\) less biased estimates \(\rightarrow\) more accurate estimates
What do we mean when we say that sampling statistics and distributions are accurate and/or precise?
A confidence interval is a numerical range inside which a statistic is expected to occur with a given probability \(1-\alpha\) (alpha) in any theoretical sample from a given population
\(1-\alpha\) is the confidence level and is often expressed as a %
This is only true if your assumptions about the population hold.
\(\alpha\) is called the confidence threshold
The statistic is expected to occur outside the confidence interval with probability \(\alpha\)
\((\alpha * 100)\)% of confidence intervals for statistics from theoretical samples of this population will NOT contain the “true” population parameter
A.K.A the Type I Error Rate or False Discovery Rate
Point estimates are more precise than confidence intervals, but they are less likely to be accurate
Confidence intervals are more likely to be accurate than point estimates, but they are less precise
A point estimate describes the location of an estimate or parameter’s distribution
A confidence interval describes the scale of an estimate or parameter’s distribution
The confidence threshold describes our uncertainty regarding these values
Choosing a confidence threshold \(\alpha\) (alpha) is a trade-off between accuracy and precision.
As confidence increases (\(\alpha \to 0\)), accuracy increases
As confidence increases (\(\alpha \to 0\)), precision decreases
A weather forecast that is not very precise might accurately describe the weather on any given day, but it’s certainly not very informative.
Will a 95% confidence interval be wider (i.e. larger range) or narrower than a 90% confidence interval?
Wider
Which is a more precise estimator: a 95% or 90% confidence interval?
90% CI
Will a 95% confidence interval be wider (i.e. larger range) or narrower than a 99% confidence interval?
Narrower
Which is more likely to be an accurate estimator: a 95% or 99% confidence interval?
99% CI, if your assumptions hold
Properties of known distributions, like the 68-95-99.7 Rule, are used to calculate the bounds of a confidence interval.
A confidence interval is defined as \(\operatorname{point estimate} \pm \operatorname{margin of error}\)
\(\operatorname{margin of error}=Z^* \times SE\)
\(Z^*=\operatorname{Z-Score}_{\alpha / 2}\)
If our confidence level is \(1-\alpha = 0.90\), then \(\alpha=0.1\). \(Z^*=\operatorname{Z-Score}_{\alpha / 2}\) and \(\alpha / 2 = 0.05\), so \(Z^*=1.645\).
\(Z^*\) corresponds to the \(\operatorname{Z-Score}\) for the probability \(\alpha / 2\).
Which of the following Z-scores is the appropriate \(Z^*\) for constructing a 98% confidence interval?
\(Z=2.05\)
\(Z = 1.96\)
\(Z = 2.33\)
\(Z = 1.64\)
\(Z=2.05\)
\(Z = 1.96\)
\(Z = 2.33\)
\(Z = 1.64\)
Facebook is trying to assess the performance of their news feed algorithm based on whether or not users feel they are seeing relevant content.
Their objective is to estimate the proportion of Facebook users who feel the algorithm works for them.
They took a random sample of American Facebook users and asked if they think Facebook accurately categorizes their interests.
569 users out of the 850 sampled (67.5%) said they felt the algorithm was accurate.
What’s the sample population?
What’s the study population?
What’s the target population?
We want to use reliable data from our sample to produce valid estimates of our study population distribution to make inferences that are generalizable to our target population.
\[ \begin{aligned} SE_p &= \sqrt{\frac{p(1-p)}{n}} \\ &= \sqrt{\frac{0.67(1-0.67)}{850}} \\ &= 0.016 \end{aligned} \]
For a 95% confidence interval, \(1 - \alpha = 0.95\) and \(\alpha = 0.05\), so \(\alpha / 2 = 0.025\). Therefore, \(Z^*_{0.95}=Z_{0.025}\).
Our 95% confidence interval is defined by \(0.67 \pm 1.96 \times 0.016\).
A 95% confidence interval for the proportion of all Facebook users who are satisfied with their algorithm is (0.64, 0.70).
Lower bound: 0.6383894
Upper bound: 0.7016106
With 95% confidence, 64-70% of American Facebook users think Facebook categorizes their interests accurately…
Based on this study, with 95% confidence, we think the average percent of all Facebook users who are satisfied with their algorithm follows the distribution \(N(0.67, 0.016)\)…
…IF your assumptions are valid
Confidence intervals for means are calculated the same was as for proportions, but with the standard error of a mean calculation.
\[ SE_{\mu} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]
DATA1220-55 Fall 2024, Class 15 | Updated: 2024-10-02 | Canvas | Campuswire